Search CORE

150 research outputs found

GUIDANCE: a web server for assessing alignment confidence scores

Author: Castresana
D. Graur
E. Privman
G. Landan
Gatesy
Giribet
H. Ashkenazy
Katoh
Landau
Lassmann
Loytynoja
Neil
Nomaguchi
O. Penn
Poirot
Rambaut
Stoye
T. Pupko
Thompson
Wong
Publication venue: Oxford University Press
Publication date
Field of study

Evaluating the accuracy of multiple sequence alignment (MSA) is critical for virtually every comparative sequence analysis that uses an MSA as input. Here we present the GUIDANCE web-server, a user-friendly, open access tool for the identification of unreliable alignment regions. The web-server accepts as input a set of unaligned sequences. The server aligns the sequences and provides a simple graphic visualization of the confidence score of each column, residue and sequence of an alignment, using a color-coding scheme. The method is generic and the user is allowed to choose the alignment algorithm (ClustalW, MAFFT and PRANK are supported) as well as any type of molecular sequences (nucleotide, protein or codon sequences). The server implements two different algorithms for evaluating confidence scores: (i) the heads-or-tails (HoT) method, which measures alignment uncertainty due to co-optimal solutions; (ii) the GUIDANCE method, which measures the robustness of the alignment to guide-tree uncertainty. The server projects the confidence scores onto the MSA and points to columns and sequences that are unreliably aligned. These can be automatically removed in preparation for downstream analyses. GUIDANCE is freely available for use at http://guidance.tau.ac.il

Crossref

PubMed Central

HENA, heterogeneous network-based data set for Alzheimer's disease.

Author: Adler P.
Collura V.
Daudin R.
Dauvillier J.
Herault Y.
Hermjakob H.
Hindie V.
Lambert J.C.
Leontjeva A.
Loe-Mie Y.
Moncion T.
Peterson H.
Pupko T.
Rain J.C.
Simonneau M.
Sügis E.
Vilo J.
Xenarios I.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/08/2019
Field of study

Alzheimer's disease and other types of dementia are the top cause for disabilities in later life and various types of experiments have been performed to understand the underlying mechanisms of the disease with the aim of coming up with potential drug targets. These experiments have been carried out by scientists working in different domains such as proteomics, molecular biology, clinical diagnostics and genomics. The results of such experiments are stored in the databases designed for collecting data of similar types. However, in order to get a systematic view of the disease from these independent but complementary data sets, it is necessary to combine them. In this study we describe a heterogeneous network-based data set for Alzheimer's disease (HENA). Additionally, we demonstrate the application of state-of-the-art graph convolutional networks, i.e. deep learning methods for the analysis of such large heterogeneous biological data sets. We expect HENA to allow scientists to explore and analyze their own results in the broader context of Alzheimer's disease research

Serveur académique lausannois

HAL-Inserm

HAL Descartes

HAL-Pasteur

A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer

Author: A.B. Simonson
B. Boussau
B. Snel
B. Snel
B.E. Dutilh
B.G. Mirkin
C. Pál
C.G. Kurland
D.H. Huson
E. Belda
E.A. Herniou
E.D. Green
E.J. Deeds
E.L.L. Sonnhammer
E.V. Koonin
F. Delsuc
F. Tekaia
G.D.P. Clarke
G.P. Karev
G.P. Karev
G.P. Karev
I.K. Jordan
J. Lin
J.A. Lake
J.O. Korbel
J.P. Gogarten
J.T. Herbeck
K.H. Wolfe
M. Csűrös
M. Pellegrini
M.G. Montague
M.W. Hahn
R.L. Tatusov
S. Karlin
S. Yang
S.T. Fitz-Gibbon
T. Pupko
V. Kunin
V. Kunin
W. Feller
W.J. Reed
X. Gu
Y. Boucher
Y.I. Wolf
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/09/2005
Field of study

We introduce a Markov model for the evolution of a gene family along a phylogeny. The model includes parameters for the rates of horizontal gene transfer, gene duplication, and gene loss, in addition to branch lengths in the phylogeny. The likelihood for the changes in the size of a gene family across different organisms can be calculated in O(N+hM^2) time and O(N+M^2) space, where N is the number of organisms,

h

is the height of the phylogeny, and M is the sum of family sizes. We apply the model to the evolution of gene content in Preoteobacteria using the gene families in the COG (Clusters of Orthologous Groups) database

arXiv.org e-Print Archive

CiteSeerX

Crossref

Refining transcriptional regulatory networks using network evolutionary models and gene histories

Author: A Bhan
A Crombach
A Stark
A Tanay
AL Barabási
Bernard ME Moret
BME Moret
C Roth
CT Harbison
D Durand
DM Hillis
G Bourque
J Kim
J Yu
KP Murphy
L Arvestad
M Kanehisa
MM Babu
MM Babu
N Friedman
N Friedman
R Wang
RDM Page
S Liang
SA Teichmann
SY Kim
T Akutsu
T Chen
T Pupko
X Zhang
X Zhang
Xiuwei Zhang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Computational inference of transcriptional regulatory networks remains a challenging problem, in part due to the lack of strong network models. In this paper we present evolutionary approaches to improve the inference of regulatory networks for a family of organisms by developing an evolutionary model for these networks and taking advantage of established phylogenetic relationships among these organisms. In previous work, we used a simple evolutionary model and provided extensive simulation results showing that phylogenetic information, combined with such a model, could be used to gain significant improvements on the performance of current inference algorithms. Results In this paper, we extend the evolutionary model so as to take into account gene duplications and losses, which are viewed as major drivers in the evolution of regulatory networks. We show how to adapt our evolutionary approach to this new model and provide detailed simulation results, which show significant improvement on the reference network inference algorithms. Different evolutionary histories for gene duplications and losses are studied, showing that our adapted approach is feasible under a broad range of conditions. We also provide results on biological data (<it>cis</it>-regulatory modules for 12 species of <it>Drosophila</it>), confirming our simulation results.</p

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient algorithms for reconstructing gene content by co-evolution

Author: AK Hudek
C Dale
C Ouzounis
D Barry
D Juan
D Sankoff
D Wall
DM Hillis
E Eden
E Gaucher
F Hadlock
H Fraser
Hadas Birin
I Elias
J Felsenstein
J Forster
J Hacia
J Neyman
J Tauberberger
J Thornton
J W
J Zhang
L J
L Skrabanek
M Blanchette
M Garey
M Pagel
M Stoer
NM Krishnan
R Jovelin
R Robichaux
S Ghaemmaghami
S Tringe
T Jermann
T Jukes
T Pupko
T Sato
T Tuller
T Tuller
T Tuller
T Tuller
Tamir Tuller
V Pe’rez-Brocal
W Cai
W Fitch
X Zhang
Y Felder
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background In a previous study we demonstrated that co-evolutionary information can be utilized for improving the accuracy of ancestral gene content reconstruction. To this end, we defined a new computational problem, the Ancestral Co-Evolutionary (ACE) problem, and developed algorithms for solving it. Results In the current paper we generalize our previous study in various ways. First, we describe new efficient computational approaches for solving the ACE problem. The new approaches are based on reductions to classical methods such as linear programming relaxation, quadratic programming, and min-cut. Second, we report new computational hardness results related to the ACE, including practical cases where it can be solved in polynomial time. Third, we generalize the ACE problem and demonstrate how our approach can be used for inferring parts of the genomes of <it>non-ancestral</it> organisms. To this end, we describe a heuristic for finding the portion of the genome ('dominant set’) that can be used to reconstruct the rest of the genome with the lowest error rate. This heuristic utilizes both evolutionary information and co-evolutionary information. We implemented these algorithms on a large input of the ACE problem (95 unicellular organisms, 4,873 protein families, and 10, 576 of co-evolutionary relations), demonstrating that some of these algorithms can outperform the algorithm used in our previous study. In addition, we show that based on our approach a ’dominant set’ cab be used reconstruct a major fraction of a genome (up to 79%) with relatively low error-rate (<it>e.g.</it> 0.11). We find that the ’dominant set’ tends to include metabolic and regulatory genes, with high evolutionary rate, and low protein abundance and number of protein-protein interactions. Conclusions The <it>ACE</it> problem can be efficiently extended for inferring the genomes of organisms that exist today. In addition, it may be solved in polynomial time in many practical cases. Metabolic and regulatory genes were found to be the most important groups of genes necessary for reconstructing gene content of an organism based on other related genomes.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evolutionary Modeling of Rate Shifts Reveals Specificity Determinants in HIV-1 Subtypes

Author: A Bobkov
A Loytynoja
Adi Stern
AJ Leslie
B Julg
B Knudsen
B Korber
C Arvieux
C Blouin
CB Moore
CD Rizzuto
CS Alexander
D Middleton
D Moreira
D Pillay
DC Nickle
DL Robertson
DL Swofford
DT Jones
E Bohnlein
EA Gaucher
Eran Bacharach
F Abascal
F Gao
F Simon
F Williams
G McGuire
H Gatanaga
HM Berman
I Mayrose
J Dutheil
J Felsenstein
J Novembre
J Zhang
JJ Lum
JK Carr
Julien Dutheil
KS Dorman
L Bocket
L VerPlank
M Kimura
M Nei
MA Accola
MA Fares
MA Wainberg
MM Thomson
N Galtier
Nicolas Galtier
Nimrod D. Rubinstein
Osnat Penn
OV Tsodikov
P Lopez
PJ Goulder
Rob J. De Boer
RW Shafer
S Abhiman
S Guindon
S Guindon
S Miller
S Rusconi
SA Travers
SG Self
SL Kosakovsky Pond
T Bhattacharya
T Pupko
T Pupko
Tal Pupko
U Ranga
V Svicher
VA Johnson
VA Novitsky
VV Lukashov
VW Pollard
W Kabsch
WK Wang
WM Fitch
WP Bannister
X Gu
X Gu
X Gu
Y Kliger
Y Liu
Y Wang
Z Yang
Z Yang
Z Yang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/11/2008
Field of study

A hallmark of the human immunodeficiency virus 1 (HIV-1) is its rapid rate of evolution within and among its various subtypes. Two complementary hypotheses are suggested to explain the sequence variability among HIV-1 subtypes. The first suggests that the functional constraints at each site remain the same across all subtypes, and the differences among subtypes are a direct reflection of random substitutions, which have occurred during the time elapsed since their divergence. The alternative hypothesis suggests that the functional constraints themselves have evolved, and thus sequence differences among subtypes in some sites reflect shifts in function. To determine the contribution of each of these two alternatives to HIV-1 subtype evolution, we have developed a novel Bayesian method for testing and detecting site-specific rate shifts. The RAte Shift EstimatoR (RASER) method determines whether or not site-specific functional shifts characterize the evolution of a protein and, if so, points to the specific sites and lineages in which these shifts have most likely occurred. Applying RASER to a dataset composed of large samples of HIV-1 sequences from different group M subtypes, we reveal rampant evolutionary shifts throughout the HIV-1 proteome. Most of these rate shifts have occurred during the divergence of the major subtypes, establishing that subtype divergence occurred together with functional diversification. We report further evidence for the emergence of a new sub-subtype, characterized by abundant rate-shifting sites. When focusing on the rate-shifting sites detected, we find that many are associated with known function relating to viral life cycle and drug resistance. Finally, we discuss mechanisms of covariation of rate-shifting sites

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

Discovering local patterns of co - evolution: computational aspects and biological examples

Author: A Tanay
B Dujon
B Snel
C Goh
D Barker
D Barker
D Chamovitz
D Juan
D Ober
D Scannell
DM Krylov
DP Wall
E Oron
F Pazos
F Pazos
I Wapinski
J Wu
JB MacQueen
K Wolfe
LM o Rami'rez
M Benton
Martin Kupiec
O Man
P Jaccard
PM Bowers
R Chenna
R Singh
RL Tatusov
S Grossmann
S Ohno
T Przytycka
T Pupko
T Tuller
T Tuller
Tamir Tuller
TD Bie
Y Chena
Y Cheng
Yifat Felder
Z Yang
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Co-evolution is the process in which two (or more) sets of orthologs exhibit a similar or correlative pattern of evolution. Co-evolution is a powerful way to learn about the functional interdependencies between sets of genes and cellular functions and to predict physical interactions. More generally, it can be used for answering fundamental questions about the evolution of biological systems. Orthologs that exhibit a strong signal of co-evolution in a certain part of the evolutionary tree may show a mild signal of co-evolution in other branches of the tree. The major reasons for this phenomenon are noise in the biological input, genes that gain or lose functions, and the fact that some measures of co-evolution relate to rare events such as positive selection. Previous publications in the field dealt with the problem of finding sets of genes that co-evolved along an entire underlying phylogenetic tree, without considering the fact that often co-evolution is local. Results In this work, we describe a new set of biological problems that are related to finding patterns of <it>local </it>co-evolution. We discuss their computational complexity and design algorithms for solving them. These algorithms outperform other bi-clustering methods as they are designed specifically for solving the set of problems mentioned above. We use our approach to trace the co-evolution of fungal, eukaryotic, and mammalian genes at high resolution across the different parts of the corresponding phylogenetic trees. Specifically, we discover regions in the fungi tree that are enriched with positive evolution. We show that metabolic genes exhibit a remarkable level of co-evolution and different patterns of co-evolution in various biological datasets. In addition, we find that protein complexes that are related to gene expression exhibit non-homogenous levels of co-evolution across different parts of the <it>fungi </it>evolutionary line. In the case of mammalian evolution, signaling pathways that are related to <it>neurotransmission </it>exhibit a relatively higher level of co-evolution along the <it>primate </it>subtree. Conclusions We show that finding local patterns of co-evolution is a computationally challenging task and we offer novel algorithms that allow us to solve this problem, thus opening a new approach for analyzing the evolution of biological systems.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Prediction of binding hot spot residues by using structural and evolutionary parameters

Author: Ahmed S
Altschul SF
Apweiler R
Arkin MR
Ban YA
Bogan AA
Bradford JR
Bradford JR
Chang CC
Clésio Luis Tozzi
Cristianini N
Darnell SJ
DeLano WL
Duda RO
Eisenberg D
el-Deiry WS
Fawcett T
Fernández-Recio J
Frishman D
Guney E
Hagerty CG
Hamelryck T
Hanley JA
Hastie T
Higa RH
Higgins D
Hu Z
Jones S
Kato S
Kidera A
Kirsch T
Koenderink JJ
Kortemme T
Li X
Liang J
Ma B
McIvor AM
Moreira IS
Neuvirth H
Platt J
Pupko R
Reddi AH
Res I
Roberto Hiroshi Higa
Rost B
Wesson L
Yuan C
Publication venue: Sociedade Brasileira de Genética
Publication date: 01/01/2009
Field of study

In this work, we present a method for predicting hot spot residues by using a set of structural and evolutionary parameters. Unlike previous studies, we use a set of parameters which do not depend on the structure of the protein in complex, so that the predictor can also be used when the interface region is unknown. Despite the fact that no information concerning proteins in complex is used for prediction, the application of the method to a compiled dataset described in the literature achieved a performance of 60.4%, as measured by F-Measure, corresponding to a recall of 78.1% and a precision of 49.5%. This result is higher than those reported by previous studies using the same data set

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Directory of Open Access Journals

PubMed Central

Repositorio da Producao Cientifica e Intelectual da Unicamp

Background frequencies for residue variability estimates: BLOSUM revisited

Author: A del Sol Mesa
AG Murzin
C Sander
C Shannon
H Berman
I Mihalek
I Mihalek
I Mihalek
I Mihalek
I Nooren
I Reš
J Donald
J Pei
K Pruitt
O Lichtarge
O Lichtarge
P Shenkin
R Development Core Team
R Edgar
S Altschul
S Henikoff
S Jones
S Kullback
S Veerassamy
T Pupko
W Atchley
W Valdar
W Valdar
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Shannon entropy applied to columns of multiple sequence alignments as a score of residue conservation has proven one of the most fruitful ideas in bioinformatics. This straightforward and intuitively appealing measure clearly shows the regions of a protein under increased evolutionary pressure, highlighting their functional importance. The inability of the column entropy to differentiate between residue types, however, limits its resolution power. Results In this work we suggest generalizing Shannon's expression to a function with similar mathematical properties, that, at the same time, includes observed propensities of residue types to mutate to each other. To do that, we revisit the original construction of BLOSUM matrices, and re-interpret them as mutation probability matrices. These probabilities are then used as background frequencies in the revised residue conservation measure. Conclusion We show that joint entropy with BLOSUM-proportional probabilities as a reference distribution enables detection of protein functional sites comparable in quality to a time-costly maximum-likelihood evolution simulation method (rate4site), and offers greater resolution than the Shannon entropy alone, in particular in the cases when the available sequences are of narrow evolutionary scope.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Protein binding hot spots and the residue-residue pairing preference: a water exclusion perspective

Author: A Fernández
AA Bogan
CJ Tsai
E Guney
F Glaser
G Moont
H Ponstingl
H Zhu
I Halperin
IM Nooren
ISS Moreira
J Li
J Li
J Martin
J Mintseris
Jinyan Li
JL Morrison
KS Thorn
L Lo Conte
N Tuncbag
O Keskin
P Chakrabarti
P Privalov
Q Liu
Qian Liu
RP Bahadur
RP Bahadur
RP Saha
S De
S Jones
S Lukman
S Miyazawa
SJ Hubbard
T Clackson
T Pupko
WL DeLano
WSJ Valdar
Y Ofran
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background A protein binding hot spot is a small cluster of residues tightly packed at the center of the interface between two interacting proteins. Though a hot spot constitutes a small fraction of the interface, it is vital to the stability of protein complexes. Recently, there are a series of hypotheses proposed to characterize binding hot spots, including the pioneering O-ring theory, the insightful 'coupling' and 'hot region' principle, and our 'double water exclusion' (DWE) hypothesis. As the perspective changes from the O-ring theory to the DWE hypothesis, we examine the physicochemical properties of the binding hot spots under the new hypothesis and compare with those under the O-ring theory. Results The requirements for a cluster of residues to form a hot spot under the DWE hypothesis can be mathematically satisfied by a biclique subgraph if a vertex is used to represent a residue, an edge to indicate a close distance between two residues, and a bipartite graph to represent a pair of interacting proteins. We term these hot spots as DWE bicliques. We identified DWE bicliques from crystal packing contacts, obligate and non-obligate interactions. Our comparative study revealed that there are abundant <it>unique </it>bicliques to the biological interactions, indicating specific biological binding behaviors in contrast to crystal packing. The two sub-types of biological interactions also have their own signature bicliques. In our analysis on residue compositions and residue pairing preferences in DWE bicliques, the focus was on interaction-preferred residues (ipRs) and interaction-preferred residue pairs (ipRPs). It is observed that hydrophobic residues are heavily involved in the ipRs and ipRPs of the obligate interactions; and that aromatic residues are in favor in the ipRs and ipRPs of the biological interactions, especially in those of the non-obligate interactions. In contrast, the ipRs and ipRPs in crystal packing are dominated by hydrophilic residues, and most of the anti-ipRs of crystal packing are the ipRs of the obligate or non-obligate interactions. Conclusions These ipRs and ipRPs in our DWE bicliques describe a diverse binding features among the three types of interactions. They also highlight the specific binding behaviors of the biological interactions, sharply differing from the artifact interfaces in the crystal packing. It can be noted that DWE bicliques, especially the unique bicliques, can capture deep insights into the binding characteristics of protein interfaces.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

OPUS - University of Technology Sydney

PubMed Central